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Abstract: In the series of our earlier papers on the subject, we proposed 
a novel statistical hypothesis testing method for detection of objects in 
noisy images. The method uses results from percolation theory and ran- 
' dom graph theory. We developed algorithms that allowed to detect objects 

of unknown shapes in the presence of nonparametric noise of unknown level 
and of unknown distribution. No boundary shape constraints were imposed 
on the objects, only a weak bulk condition for the object's interior was re- 
^ ■ quired. Our algorithms have linear complexity and exponential accuracy. 

In the present paper, we describe an implementation of our nonparamet- 
ric hypothesis testing method. We provide a program that can be used for 
00 , statistical experiments in image processing. This program is written in the 

statistical programming language R. 

<N 

C "3 ■ Keywords and phrases: Image analysis, signal detection, image recon- 

struction, percolation, noisy image, unsupervised machine learning, spatial 
statistics. 



1. Introduction 



Assume we observe a noisy digital image on a screen of N x N pixels. Ob- 
ject detection and image reconstruction for noisy image s are two of the corner- 



stone p r oblems in image an a lysis. In the series of papers lLangovov a nd WittichJ 

(|2009al ). lDavies et al.l|2009t) . lLangovov and Wittichl (|2009bl ) and lLangovov and Wittich 



2010h . we proposed a new efficient technique for quick detection of objects in 



noisy images. Our approach uses mathematical percolation theory. 

Detection of objects in noisy images is the most basic problem of image analy- 
sis. Indeed, when one looks at a noisy image, the first question to ask is whether 
there is any object at all. This is also a primary question of interest in such 
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diverse fields as, for exa mple, cancer detec tion ( Ricci- Vitiani et al. ( 2007h ). au 



tomated urban analysis dNeeri et al" ( 20061) 1. detection of cracks in buried pipes 



(ISinha and Fieeutbl ((2006))," and other possible applications in astronomy, elec- 
tron microscopy and neurology. Moreover, if there is just a random noise in the 
picture, it doesn't make sense to run computationally intensive procedures for 
image reconstruction for this particular picture. Surprisingly, the vast majority 
of image analysis methods, both in statistics and in engineering, skip this stage 
and start immediately with image reconstruction. 

The crucial difference of our method is that we do not impose any shape 
or smoothness assumptions on the boundary of the object. This permits the 
detection of nonsmooth, irregular or disconnected objects in noisy images, under 
very mild assumptions on the object's interior. This is especially suitable, for 
example, if one has to detect a highly irregular non-convex object in a noisy 
image. This is usually the case, for example, in the aforementioned fields of 
automated urban analysis, cancer detection and detection of cracks in materials. 
Although our detection procedure works for regular images as well, it is precisely 
the class of irregular images with unknown shape where our method can be very 
advantageous. 

We approached the object detection problem as a hypothesis testing problem 
within the class of statistical inverse problems in spatial statistics. We were able 
to extend our approach to the nonparametric case of unknown noise density, 
and this density was not assumed smooth or even continuous. It is even possible 
that the noise distribution doesn't have a density or that the noise distribution 
is heavy-tailed. 

We have shown that there is a deep connection between the spatial structure 
chosen for the discretisation of the image, the type of the noise distribution on 
the image and statistical properties of object detection. These results seem to 
be of independent interest for the field of spatial statistics. 

In ou r pre yious papers, we considered the case of square lattices in Langovov and Wittier] 



( 2009a ) an d lLangovov and Wittichl (l2009bh . triangular lattices in lDavies et al 



20091 ) and Langovov and Wittichl (120101) and even the case of general periodic 



lattices in lLangovov and Wittichl ((2010h . In all those cases, we proved that our 



detection algorithms have linear complexity in terms of the number of pixels 
on the screen. These procedures are not only asymptotically consistent, but on 
top of that they have accuracy that grows exponentially with the " number of 
pixels" in the object of detection. All of the detection algorithms have a built-in 
data-driven stopping rule, so there is no need in human assistance to stop an 
algorithm at an appropriate step. 

In view of the above, our method can be viewed as an unsupervised learning 
method, in the language of machine learning. This makes our results valuable 
for the field of machine learning as well. Indeed, we do not only propose an 
unsupervised method, but also prove the method's consistency and even go as 
far as to prove the rates of convergence. 

In this paper, we describe an algorithmic implementation of our nonpara- 
metric hypothesis testing / image processing method. We also provide an ac- 
tual program that was used for our research at the initial stages. This program 
is written in the statistical programming language R. It is known that, as a 
programming language, R is rather slow. At present, we have implemented our 
algorithms in CH — h, and they run substantially faster. However, R is a free and 
convenient language for statistical programming, since R has many advanced 
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statistical procedures already built-in. This language also has convenient tools 
for generating random variables of various types, therefore R is suitable for 
statistical experiments with our method. 

The paper is organized as follows. Section 2.1 describes some basic methods 
for converting, storing and processing images in R. Our implementation of lat- 
tice neighborhood structures is given in Section 2.2. The Depth First Search 
algorithm and its R-implementation for image analysis are described in Section 
3. In Section 4 we explain how the fast Newman-Ziff algorithm from discrete 
algorithmic probability can be effectively used in our image analysis method. 
Subsection 4 describes the theory behind the Newman-Ziff approach, while Sub- 
section 4.1 introduces our modification of this algorithm, suitable for percola- 
tion probabilities that are inhomogeneous over the lattice. This modification 
allows for finer simulation studies for any given finite screen size, and this is 
particularly useful for small images. Subsections 4.2 and 4.3 are devoted to R- 
implementations of the Newman-Ziff algorithm and the modified Newman-Ziff 
algorithms respectively. 

2. General remarks about image processing in R 
2.1. Converting, storing, and processing images 

Image processing in R uses the package r image which allows to convert an actual 
picture (in .jpg-format) to an R-object called imagematrix. In the sequel, 
we store images as vectors of grey-values where the entries of the vector are 
intensities of the corresponding pixels going through the image row by row. The 
vectors arc manipulated and converted to imagematrices which can finally be 
displayed. 

## SOME CLUSTER MACROS 
# 

## use only with "library (rimage) " 

## Some functions converting data structures 

# 

# 

# the functions convert vectors in matrices and imagematrices; 

# they work properly only if the components are either or 1 
# 

## Function: Image to Matrix (img2mat) 

img2mat <- f unction(img) 
{ 

mat <- matrix(img,nrow(img) ,ncol(img)) 

return (mat) 

} 

## Function: Matrix to Image (mat2img) 

mat2img <- function (mat) 
{ 
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img <- imagematrix(mat) 

return (img) 

} 

## Function: Matrix to Vector (mat2vec) 

mat2vec <- function (mat) 
{ 

vec <- as. vector (mat) 

return(vec) 

} 

## Function: Vector to Matrix (vec2mat) 

vec2mat <- function(vec,n,m) 
{ 

mat <- matrix(vec,n,m) 

return (mat) 

} 

The following code just provides you with an example of generating a graphical 
representation of site percolation on a lattice: 

## EXAMPLE: Percolation 

# simulates site? percolation on an n x n lattice 

# with parameter <= p <= 1 

perco <- function (p,n) 
{ 

v <- rbinom(n*n, 1 ,p) 
ma <- vec2mat(v,n) 
im <- mat2img(ma) 
return(im) 
} 

2.2. Neighborhood structures 

For the percolation cluster analysis, it is important to implement the topology 
of the lattice, i.e. the neighborhood structure. Since the image - i.e. the lattice 
site intensities - are stored as a vector, the neighborhood structures are stored 
as list-objects. Assigned to every position in the vector (i.e. every site, the 
list-index) there is a vector of other vector positions, i.e. the indices of the 
neighboring sites. These lists depend on the size of the rectangular grid and have 
to be computed only once. Below we give an implementation for the standard 
cases of 4-, 6-, and 8-neighborhoods on a rectangular grid. 

###### Function: Adjacency Matrix (adj_mat) 

# 

# 

# the function generates a list of all neighbors 

# of points in a flat rectangular grid consisting of 



imsart-generic ver. 2007/04/13 file: Statistical_Image_Analysis_in_R.tex date: January 13, 2013 



M. Langovoy and O. Wittich/ Image processing in R. 



5 



# n: columns, m :rows; it works properly only for 

# for n,m > 2 ! 

## 4-neighborhood 

adj4mat <- f unction (n,m) 
{ 

neighbors <- listO 

## corners 

neighbors [ [1] ] 
neighbors [ [n] ] 
neighbors [ [n*m] ] 
neighbors [ [(m-l)*n + 

## boundary 

# (a) above 

ford in 2 : (n-1) ) {neighbors [ [i] ] <- c(i-l,i+l,i+n)> 

# (b) below 

ford in ( (m-1) *n+2) : (n*m-l) ) {neighbors [ [i] ] <- c(i-l,i+l,i-n)} 
### (c) left 

ford in 1 : (m-2) ) {neighbors [ [n*i+l] ] <- c(n*(i-l)+l,n*(i+l)+l, 

n*i+2)} 

### (d) right 

ford in 2 : (m-1) ) {neighbors [ [n*i] ] <- c(n*d-l) ,n*d+l) ,n*i-l)} 

## interior 

ford in 1: (m-2)) 
{ 

for(j in 1: (n-2)) 
{ 

neighbors [[i*n + 1 + j]]<- c(i*n + j,i*n + 2 + j , (i-1) *n+l+j , 

(i+l)*n+l+j) 

} 
} 

return (neighbors) 
} 
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## 8-neighborhood 

adj8mat <- f unction (n,m) 
{ 

neighbors <- listO 

## corners 

neighbors [ [1] ] 
neighbors [ [n] ] 
neighbors [ [n*m] ] 
neighbors[[(m-l)*n + 

## boundary 

# (a) above 

for(i in 2 : (n-1) ) {neighbors [ [i] ] <- c (i-1 , i+1 , i+n, i+n-1 , i+n+1) } 

# (b) below 

for(i in ((m-l)*n+2) : (n*m-l)) {neighbors [[i]]<- c(i-l , i+1 , i-n, 

i-n-l,i-n+l)> 

### (c) left 

ford in 1 : (m-2) ) {neighbors [ [n*i+l] ] <- c(n*(i-l)+l,n*(i-l)+2, 

n*(i+l)+l,n*(i+l)+2,n*i+2)} 

### (d) right 

for(i in 2 : (m-1) ) {neighbors [ [n*i] ] <- c(n*(i-l), n* (i-1) -1 ,n* (i+1) , 

n*(i+l)-l,n*i-l)} 

## interior 

ford in 1: (m-2)) 
{ 

for(j in l:(n-2)) 
{ 

neighbors [[i*n + 1 + j]]<- c(i*n + j,i*n + 2 + j , (i-1) *n+l+j , 

(i-l)*n+j , (i-l)*n+j+2, (i+l)*n+l+j, (i+l)*n+2+j, (i+l)*n+j) 

} 
> 

return (neighbors) 
> 
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## 6-neighborhood (triangular lattice) 

adj6mat <- f unction (n,m) 
{ 

neighbors <- listO 

## corners 

neighbors [ [1] ] 
neighbors [ [n] ] 
neighbors [ [n*m] ] 
neighbors [ [(m-l)*n 

## boundary 

# (a) above 

ford in 2 : (n-1) ) {neighbors [ [i] ] <- c (i-1 , i+1 , i+n, i+n-1) } 

# (b) below 

ford in ( (m-1) *n+2) : (n*m-l) ) {neighbors [ [i] ] <- c(i-l,i+l,i-n, 

i-n+1)} 

### (c) left 

ford in 1 : (m-2) ) {neighbors [ [n*i+l] ] <- c(n*(i-l)+l,n*(i-l)+2, 

n*(i+l)+l,n*i+2)} 

### (d) right 

ford in 2 : (m-1) ) {neighbors [ [n*i] ] <- c(n*(i-l) ,n*(i+l) ,n*(i+l)-l, 

n*i-l)} 

## interior 

ford in 1: (m-2)) 
{ 

for(j in 1: (n-2)) 
{ 

neighbors [[i*n + 1 + j]]<- c(i*n + j,i*n + 2 + j , (i-1) *n+l+j , 

(i-l)*n+j+2, (i+l)*n+l+j, (i+l)*n+j) 

} 
> 

return (neighbors) 



<- c(2,n+l,n+2) 
<- c(n-l,2*n,2*n-l) 
<- c(n*(m-l) ,n*m -1) 
+ 1]] <- c((m-l)*n + 2, (m-2)*n+l, (m-2)*n+2) 



imsart-generic ver. 2007/04/13 file: Statistical_Image_Analysis_ln_R.tex date: January 13, 2013 



M. Langovoy and O. Wittich/ Image processing in R. 



8 



3. Tarjan's Depth First Search algorithm 

In order to be able to perform statistical analysis of percolation clusters, we 
need an algorithm that finds those clusters in the threshol ded binary im age. 
For this purpose, we used the Depth First Search algorithm ( Tarianl ( 1972h ). It 



is remarkable that this algorithm has linear complexity, in terms of the size of 
the input data. In our particular problem, the algorithm finds the clusters in 
a number of steps that is linear with respect to the number of pixels on the 
screen. 

The key point of the Depth First Search algorithm is the construction of a 
so called spanning tree of a given cluster. Starting from a seed point with label 
1 that belongs to the cluster, we begin to explore all neighbors in an order 
determined by the ordering of the corresponding vector given in the list that 
encodes the neighborhood structure. The first neighboring site which belongs to 
the cluster gets label 2 and we repeat the procedure with this site. The labels 
are stored. Given we have arrived after some steps at a site with label n, the 
procedure is continued as follows 

1. Explore the neighbors of the current site: 

• The first neighbor that is so far unlabeled and belongs to the cluster 
gets label n+1, we store the current site (with label n) as mother-site 
of the new site and we proceed with the new site as current site. 

• If there is no neighbor which is unlabeled so far and belongs to the 
cluster, we return to the mother-site (backtracking) as current site. 

2. The algorithm terminates, if all neighbors of the seed point are explored. 

The cluster is now given by all labeled sites which form a tree (i.e. the spanning 
tree) whose root is the seed point and the different labels display the distance of 
a given site to the root. 

Remark on running time. Please note that the subsequent R-routines are 
extremely slow. A corresponding C-program called from R yields much better 
results. 



3.1. R-Code: Depth First Search 

The function that generates a spanning tree for the cluster with seed point r in 
a given configuration x € {0, l} axfc is given by 

## Function: Spanning Tree (spanning_tree) 

spanning_tree <- function(x,a,b,r) 
{ 

###### DEPTH FIRST SEARCH 
n <- a 
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m <- b 

mother <- c() 
fig <- 

rank <- x 

ngh <- adj6mat (n,m) 

## starting point 

start <- r 

current_rnk <- 1 
current_pos <- start 

rank [start] <- current_rnk 



while (current_rnk > 0) 



L <- length (ngh [ [current_pos] ] ) 
i <- 1 

#### check neighbors 

while (i <= L) 
{ 

if (rank [ngh [[current_pos]] [i]] != 0){i <- i+1} 
else{## mother (new vertex found) 

mother [ngh [ [current_pos] ] [i] ] <- current_pos 
current_pos <- ngh [ [current_pos] ] [i] 
current_rnk <- current_rnk + 1 
rank [current_pos] <- current_rnk 
i <- L + 1 

fig <- 1 ## found something new 

} 

} 

### nothing found ?? 

if (fig == 0) 
{ 

### backtracking 

current_pos <- mother [current_pos] 
current_rnk <- current_rnk - 1 
} 
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else 

{fig <- 0} 
> 

return (rank) 
} 

The function returns a vector of zeros for points that do not belong to the cluster 
and numbers 1,2,... of labels attached to the cluster points by the Depth First 
Search algorithm. 



4. The Newman - Ziff algorithm 



The Newman - Ziff algorithm (jNewman and Zifi (l2000h is an algorithm to simu- 



late the distribution of (functions of) the size of the largest cluster in percolation 
theory. In our image analysis method, we use this algorithm to study the null 
hypothesis, to find critical cluster sizes for object detection and to estimate the 
power of our test. In addition, this algorithm is extremely useful for controlling 
the false detection rate and for studying the finite sample performance of our 
method. 

We will explain the Newman - Ziff approach for site percolation since only 
minor changes have to be made to cover bond percolation as well. Let now Q 
be a finite graph with S = \G\ sites and < p < 1 the probability that an 
individual site is active. Two sites s, s' € Q are neighbors if the graph contains a 
bond b(s, s') joining them, a subset of active sites C C Q is called a (connected) 
cluster if any two sites in C can be connected by a chain of neighboring active 
sites. Thus, the notion of connectedness is determined by the configuration of 
bonds between sites in the graph, i.e. by the topology of Q. 

We will now concentrate on the random variable given by the maximum cluster 
size 

M := max{|C| : C C Q cluster}. (1) 

The basic idea of the algorithm is now to split up the probabilities using the 
random variable 

N := \{s e Q : s active} |. (2) 

given by the number of active sites. Then 
s 

P(M = m ) = ^2 P ( M = m\N = n) P{N = n) (3) 

and since the sites are active mutually independent with probability p, the num- 
ber of active sites is binomially distributed, i.e. 

F(N = n) = ( I ) P n (l-p) s - n . 
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Furthermore, the conditional probabilities p mn := P(M = m \ N — n) depend 
solely on the topology of the graph. The activation probability p influences the 
distribution of M via adjusting the number N of active bonds alone. 

The Newman - Ziff algorithm is an algorithm to simulate those conditional prob- 
abilities pmn- Let for fixed n denote M n £ {0, 1, ...,n} be the random variable 
that is distributed by p nm - Then we can consider all M n simultaneously as being 
generated by random permutations U of {1, S}: 

We consider a permutation ir as a random visiting scheme of the sites of Q and 
7r (1 ),..., ir(n) to be the first n sites that are visited. Considering these sites as 
active, the set ...,w(n)} consists of several connected clusters. We denote 

by X n (ir) the size of the largest of these clusters. 

Lemma 1. The distribution of X n (n) coincides with the one of M n for all n. 
Proof: A random permutation is given by P(II = n) = 1/5!, hence 

P(n(l)=7r(l),..,n(n)=7r(n)) = 

and all possible configurations of n active sites are equi-distributed. This is the 
same if you choose active sites independently with probability p and consider 
the conditional probability having n active sites. ■ 

The advantage of this is twofold: First of all, we have to generate only one ran- 
dom permutation to get estimators M n for all n = 1, S, and, crucially, we can 
add active sites inductively according to the random visiting scheme generated 
by the permutation which makes it much easier and less storage consuming to 
keep track of the largest cluster. 

Thus, one single run of the estimation procedure consists of 

1. let M = 0, 

2. generate a random permutation, 

3. add active sites inductively according to the random visiting scheme gen- 
erated by the permutation, 

4. after adding site n, check whether this site leads to enlargement of one or 
the merging of several already existing clusters, 

5. check whether the size m of the new cluster generated by adding site n is 
larger than M„_i, 

6. if so, let M n — m, otherwise M n = M„_i, 

7. record M n , n = l, S. 

For practical implementations, the active sites are recorded by assigning a la- 
bel according to which cluster the site belongs (0: not active, 1: belonging to 
cluster 1,...). These labels have to be updated every time a site is added. This 
requires some work if adding of a new site results in merging several different 
clusters. Thus, in the course of one single run (meaning a run according to one 
generated random permutation), we also have to record the current label vector 
L £ {0, 1, S} s . The conditional probabilities can then be estimated by car- 
rying out several runs of this scheme and using the strong law of large numbers 
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for suitably chosen indicator functions as usual. This holds for the conditional 
probabilities as well as for the unconditional distribution of the maximum clus- 
ter size after convoluting with the binomial distribution of the number of active 
sites above. 



4-1. The modified Newman - Ziff algorithm 



To simulate the type II error for the case of small images, in lLangovov and Wittich 



(2010) we have used a modified version of the Newman - Ziff algorithm suitable 



for site occupation probabilities which are inhomogeneous over the lattice. The 
modified Newman - Ziff algorithm computes the distribution of the magnitudes 
of the largest clusters when the site occupation probability is inhomogeneous. 

To be precise, we have a subgrid Q 1 C Q where the occupation probability is 
Pi, whereas it is pi for sites in Q — Q' . In that case, we have to store the largest 
cluster sizes for all configurations (N,M) = (n, m) of n < \G'\ occupied sites in 
Q' and m < \Q\ occupied sites in Q. The probabilities of those configurations are 



P((JV,M) = (n,m))= ( ^ ) ( ^ yUl-P^'^PTil-P*) 



\S\— m 



The basic idea of the algorithm remains essentially the same. We solved the 
additional problem of running through all configurations by using two different 
visiting schemes, one for inside Q' and another one for outside Q' . 



4-2. R-Code: Newman - Ziff algorithm 

The implementation consists of one main function simulation which produces 
a list of simulated distribution functions of the maximum cluster size for a rect- 
angular n x m-grid with six-neighborhood for different activation probabilities 
p (n and m, the number of runs runs and the different activation probabilities 
probs under consideration are specified in the header and have to be changed 
manually) . 

Furthermore, we have six subroutines, given by 

1. distribution which produces an empirical distribution function of the 
maximum cluster size out of the different estimators (Mq, M n ) pro- 
duced by the subsequent runs of run, 

2. generate generates a suitable vector of binomial probabilities, 

3. run one run of the Newman - Ziff algorithm as described above, generates 
the estimators (Mq, ...,M n ), 

4. rvsch generates the random visiting scheme, 

5. sixngh is our way to implement the neighborhood structure (i.e. the topol- 
ogy of the graph), 

6. sizeprob generates an unconditional empirical distribution function. 

############################ MAIN ######################### 
## 

## 'simulation' 
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## 

## uses: 'distribution' 
## 

## change 'n' , 'm' , 'runs' , 'probs' manually 

simulation <- functionO 
{ 

n <- 55 
m <- 55 
runs <- 1000 

probs <- c(0. 1,0. 2, 0.3, 0.4, 0.42, 0.44, 0.46, 
0.48,0.5,0.52,0.54,0.56,0.58, 
0.6,0.7,0.8,0.9) 

N <- n*m 

all_dist <- listO 

for(PP in 1:17) 
{ 

all_dist [ [PP] ] <- distribution (n, m, runs, probs [PP] ) 

print (PP) 

} 

return (all_dist) 
} 



## 

## 'distribution' generates estimate of cdf 
## 

## uses: 'generate' , 'sixngh' , 'sizeprob' , 'run' 

## used by: 'simulation' 

## 

distribution <- function(n,m,runs,p) 
{ 

N <- n*m 

ngh <- sixngh(n,m) 
probs <- rep(0,N) 

binprob <- generate (N,p) 

for(r in l:runs) 
{ 

size <- run(n,m,ngh) 

sizeprob <- sizeprob(size , binprob) 



imsart-generic ver. 2007/04/13 file: Statistical_Image_Analysis_ln_R.tex date: January 13, 2013 



M. Langovoy and O. Wittich/ Image processing in R. 14 

probs <- probs + sizeprob 
} 

probab <- probs/runs 

return (probab) 

} 



########## 
## 

## 'generate' generates vector of binomial probabilities 
## 

## input: n 'number of trials', p 'success probability' 
## output: Bin(n,p) probability vector 
## used by: distribution 

generate <- f unction (n,p) {outcomes <- c(0:n) 

probs <- dbinom (outcomes ,n,p) 
return(probs)} 



## 

## 'run' — "one single run of the newman ziff algorithm" 
## 

## remark: the 'updating step' uses special properties of 
## six-neighborhoods for the conditioning 

## input: n,m 'grid rows and columns' 

## ngh 'neighborhood structure, provided by 'sixngh' 

## uses: 'sixngh', 'rvsch' 

## output: size 'size[TT]: maximum cluster size for TT 
## occupied bonds' 

run <- f unction (n,m, ngh) 
{ 

## define vector 'size' and set N 'number of edges' 
size <- c() 
N <- n*m 

## generate VS <- rvsch(N) 'random visiting scheme' 

## == random permutation of 1 N 

VS <- rvsch(N) 

## initialise clus 'cluster label vector' 

## 0: site not occupied, k>0: site belongs to cluster k 

clus <- rep(0,N) 

## T 'step number' (for a single run) 
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for(TT in 1:N) 
{ 

## maximal cluster label 
maxi <- max(clus) 

## random edge 
edge <- VS[TT] 

## neighbors of random edge 
neig <- ngh[[edge]] 

## all cluster labels in neighborhood of random edge, 
## ordered by magnitude 

## lbl = 0: unoccupied edge, lbl = k: edge belongs to 
## cluster k 

## remark: a moment of reflection shows that always 
## length (labl) <=4 for the triangular lattice 

labl <- sort (unique (clus [neig] ) ) 

L <- length(labl) 

## updating labels: 

## case #1: no occupied bonds in the neighborhood 
## case #2: random edge connects all clusters in the 
## neighborhood, new label = maximum label 

if (labl [L] == 0) {clus [edge] <- 1 + maxi}else 
{ 

if(L == 1) {clus [edge] <- labl[l]}else 
{ 

for(k in 2:L) 
{ 

z <- (labl[L] - labl[k])*(l-abs(sign(clus - labl[k]))) 

clus <- clus + z 

clus [edge] <- labl [L] 

} 

} 

} 

## calculate the maximum cluster size 

## either the cluster just constructed by merging the clusters 
## in the neighborhood of the random edge is now the largest 
## cluster, or the largest cluster remains the same 

newlabel <- clus [edge] 

mergedsize <- length(clus [clus == newlabel]) 

if (TT > 1) 
{ 

if (mergedsize > size [TT-1] ) {size [TT] <-mergedsize 
}else{size [TT] <-size [TT-1] } 
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} 

else{size[l] <- 1} 

> 

## output 'size' vector 
return(size) 
} 

########## 
## 

## 'rvsch' "random visiting scheme" 
## 

## used by: 'run' 

rvsch <- function(N){ x <- c(l:N) 

z <- sample(x,N) 
return (z) 

} 

########## 
## 

## 'sixngh' 6-neighborhood (triangular lattice) 
## 

## used by: 'run' 

sixngh <- f unction(n,m) 
{ 

neighbors <- listO 

## corners 

neighbors [ [1] ] 
neighbors [ [n] ] 
neighbors [ [n*m] ] 
neighbors [ [(m-1) *n 

## boundary 

# (a) above 

ford in 2 : (n-1) ) {neighbors [ [i] ] <- c(i-l,i+l,i+n,i+n-l)> 

# (b) below 

ford in ( (m-1) *n+2) : (n*m-l) ) {neighbors [ [i] ] <- c(i-l,i+l,i-n,i-n+l)} 

### (c) left 
ford in 1: (m-2)) 

{neighbors [[n*i+l]] <- c(n*(i-l)+l,n*(i-l)+2,n*(i+l)+l,n*i+2)} 



<- c(2,n+l) 
<- c(n-l,2*n,2*n-l) 
<- c (n* (m-1) ,n*m -1) 
+ 1]] <- c((m-l)*n + 2, (m-2)*n+l, (m-2)*n+2) 
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### (d) right 

for(i in 2 : (m-1) ) {neighbors [ [n*i] ] <- c(n*(i-l) ,n*(i+l) ,n*(i+l)-l ,n*i-l)} 

## interior 
for(i in 1: (m-2)) 
{ 

for(j in 1: (n-2)) 
{ 

neighbors [[i*n + 1 + j]]<- c(i*n + j,i*n + 2 + j , (i-1) *n+l+j , 

(i-l)*n+j+2, (i+l)*n+l+j , (i+l)*n+j) 

} 
} 

return (neighbors) 
} 



########## 
## 

## 'sizeprob' generates unconditional empirical cdf 
## 

## used by: 'distribution' 

## input: k cluster size, integer 

## 'binprob' suitable binomial probabilities from 'generate' 

## 'size' vector from simulation 'run' 

## 

## output: 'estimator' vector of P(max <= k) , k = 0,...,n 

sizeprob <- f unction(size , binprob) 
{ 

estimator <- c() 
size_ext <- c(0,size) 

for(k in 1 : length(size) ) 
{ 

## 'indicator' determines all entries of 'size_ext' 
## less or equal than k 

indicator <- 0.5*(1 - sign(size - k - 0.5)) 

## 'estimator' is the estimator of P(max <= k) based 
## on the single run 

estimator [k] <- sum(c (1 , indicator) * binprob) 
} 

return (estimator) 
} 

4-3. R-Code: modified Newman - Ziff algorithm 

The modified algorithm uses the functions sixngh and rvsch from the previous 
collection and furthermore the following code. 
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########################## MAIN_MDDI ############################################## 
## 

## ' simulation_modi ' 
## 

## uses : ' run_modi ' , ' sizeprob_modi ' 
## 

## change manually: x,y gridsize, runs number of runs, s_set subgrid, 



## P_in, P_out site occupation probabilities inside and 

## outside the subgrid 

## 

## output: 'estimate' estimated cluster size distribution 



########################## MAIN_M0DI ################################## 
## 

## ' simulation_modi ' 
## 

## uses: 'run_modi' , ' sizeprob_modi ' 
## 

## change manually: x,y gridsize, runs number of runs, s_set subgrid, 



## P_in, P_out site occupation probabilities inside and 

## outside the subgrid 

## 

## output: 'estimate' estimated cluster size distribution 



simulation_modi <- functionO 
{ 

## we consider fixed grid size (square grid) with 55x55 sites 

x <- 55 
y <- 55 
n <- x*y 
runs <- 100 

## 's_set' is a vector giving the positions of sites belonging to the subset 
## positions are given by a vector, the sites in the grid are labelled as: 
## 

tttt 1 j 2 j 3 j • • • ; x 

## x+l 2x 

## 

## (y-l)x+l. . . xy 
## 

## example: lOxlO-subset left upper corner at location (20,20) 

s_set <- c(c (1065: 1074) ,c (1120: 1129), c (1175: 1184) ,c(1230: 1239) ,c (1285: 1294) , 
c(1340: 1349), c(1395: 1404), c(1450: 1459) ,c(1505: 1514) ,c(1560: 1569)) 
ngh <- sixngh(x.y) 
P_in <- 0.6 
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P_out <- 0.4 

est <- rep(0,n) 

for(LL in l:runs) 
{ 

single_sim <- run_modi(x,y,s_set,ngh) 

est <- est + sizeprob_modi (single_sim,P_in,P_out) 

print (LL) 

} 

estimate <- est/runs 

return (estimate) 
} 



## function "run_modi" — 'one single run of the modified newman ziff algorithm' 
## remark: the 'updating step' uses special properties of six-neighborhoods for 
## the conditioning 

## input: n,m 'grid rows and columns' 

## ngh 'neighborhood structure, provided by 'sixngh' 

## s_v 'positions of subset sites' 

## functions used: 'sixngh', 'rvsch' 
## 

## output: size_out (list) ' size [ [a] ] [b] : maximum cluster size for a-1 occupied bonds ins 
## and b-1 occupied bonds outside the subset' 

run_modi <- function(n,m,s_v,ngh) 
{ 

## define vector 'size' and set M 'number of edges in the subset' and 

## N number of sites in the grid 

size <- c() 

M <- length(s_v) 

N <- n*m 

## generate VS <- rvsch(M) 'random visiting scheme of subset' 
## == random permutation of 1, . . . ,M 
VS <- rvsch(M) 

## initialise clus 'cluster label vector' 

## 0: site not occupied, k>0: site belongs to cluster k 

clus <- rep(0,N) 

## initialize list of all cluster vectors 
s_clus <- listO 

## TT 'step number' (for a single run for the subset) 
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for(TT in 1:M) 
{ 

## maximal cluster label 
maxi <- max(clus) 

## random edge inside subset 
edge <- s_v[VS[TT]] 

## neighbors of random edge 
neig <- ngh[[edge]] 

## all cluster labels in neighborhood of random edge, ordered by magnitude 
## lbl = 0: unoccupied edge, lbl = k: edge belongs to cluster k 
## remark: a moment of reflection shows that always length (labl) <=4 for the 
## triangular lattice 

labl <- sort (unique (clus [neig] ) ) 

L <- length(labl) 

## updating labels: case #1: no occupied bonds in the neighborhood 
## case #2: random edge connects all clusters in the 

## neighborhood, new label = maximum label 

if (labl [L] == 0) {clus [edge] <- 1 + maxi>else 
{ 

if(L == 1) {clus [edge] <- labl[l]}else 
{ 

for(k in 2:L) 
{ 

z <- (labl[L] - labl[k])*(l-abs(sign(clus - labl[k]))) 
clus <- clus + z 
clus [edge] <- labl [L] 

} 

} 

} 

## calculate the maximum cluster size 

## either the cluster just constructed by merging the clusters in the 
## neighborhood of the random edge is now the largest cluster, or the 
## largest cluster remains the same 

newlabel <- clus [edge] 

mergedsize <- length(clus [clus == newlabel]) 

if (TT > 1) 
{ 

if (mergedsize > size [TT-1] ) {size [TT] <-mergedsize}else{size [TT] <-size [TT-1] } 
} 

else{size[l] <- 1} 
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## cluster status vector 'clus' has now to be stored for all numbers of 
## occupies sites in the subset 

s_clus[[TT]] <- clus 
} 



## output 'status' vector 's_clus' + value for active subset sites, stored in 
## s_clus[[M+l]] , 'size' == 

s_clus[[M+l]] <- rep(0,N) 
size[M +1] <- 

## second part: adding active sites to the complement of the subset 

## remark: the 'updating step' uses special properties of six-neighborhoods for 

## the conditioning 

## define vector 'o_v': complement of subset 

latt <- c(l:N) 

o_v <- setdif f (latt , s_v) 

J <- length(o_v) 

anoth_size <- c() 

## generate VS <- rvsch(L) 'random visiting scheme' 
## of the complement 
WS <- rvsch(J) 

## initialize list of all cluster sizes 
s_clussize <- listO 

## WW 'step number' for initial subset configuration 
VV <- M+l 

for (WW in 1:VV) 
{ 

clus <- s_clus[[WW]] 

## TT 'step number' (for a single run) 

for(TT in 1:J) 
{ 

## maximal cluster label 
maxi <- max (clus) 

## random edge inside subset 
edge <- o_v[WS[TT]] 

## neighbors of random edge 
neig <- ngh[[edge]] 
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## all cluster labels in neighborhood of random edge, ordered by magnitude 
## lbl = 0: unoccupied edge, lbl = k: edge belongs to cluster k 
## remark: a moment of reflection shows that always length (labl) <=4 for the 
## triangular lattice 

labl <- sort (unique (clus [neig] ) ) 

L <- length(labl) 

## updating labels: case #1: no occupied bonds in the neighborhood 
## case #2: random edge connects all clusters in the 

## neighborhood, new label = maximum label 

if (labl [L] == 0) {clus [edge] <- 1 + maxi}else 
{ 

if(L == 1) {clus [edge] <- labl [1] }else 
{ 

for(k in 2:L) 
{ 

z <- (labl[L] - labl[k])*(l-abs(sign(clus - labl[k]))) 
clus <- clus + z 
clus [edge] <- labl [L] 

} 

} 

} 

## calculate the maximum cluster size 

## either the cluster just constructed by merging the clusters in the 
## neighborhood of the random edge is now the largest cluster, or the 
## largest cluster remains the same 

newlabel <- clus [edge] 

mergedsize <- length(clus [clus == newlabel]) 

if (TT > 1) 
{ 

if (mergedsize > anoth_size [TT-1] ) 

{anoth_size [TT] <- mergedsize}else{anoth_size [TT] <- anoth_size [TT-1] } 
} 

else{ 

if (mergedsize > size [WW] ) 

{anoth_size [1] <- mergedsize}else{anoth_size [1] <- size [WW] > 

} 

} 



s_clussize [ [WW] ] <- anoth_size 
} 

## real output 'size_out' 
size_out <- listO 

size_out [ [1] ] <- c(0,s_clussize [ [M+l] ] ) 
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for(l in 1:M) 
{ 

size_out [ [1+1] ] <- c(size [1] , s_clussize [[1]] ) 
} 

return (size_out) 
} 



## ' sizeprob_modi ' calculates an estimator for the probability of maximum 

## cluster sizes not to exceed a given value k 

## 

## input: 'P_in' , 'P_out' site occupation probabilities inside and outside 
## the subset 

## 'size_out' list from simulation 'run_modi' 

## 

## output: 'estimator' vector of P(max <= k) , k = 0,...,n 

sizeprob_modi <- f unction(size_out ,P_in,P_out) 
{ 

indicator <- listO 
estimator <- c() 

# number of inner sites 
VV <- length(size_out) 
V <- VV - 1 

# number of outer sites 

WW <- length(size_out[[l]]) 
W <- WW - 1 

# total number of sites 
NN <- V + W 

# binomial probability vectors 
out_sites <- c(0:W) 

bin_out <- dbinom(out_sites ,W,P_out) 

in_sites <- c(0:V) 

bin_in <- dbinom(in_sites,V,P_out) 



for(k in 1:NN) 
{ 

estimator [k] <- 
ford in 1:VV) 
{ 

## 'indicator' determines all entries of 'size_out' les 
s or equal than k 

indicator [[i]] <- 0.5*(1 - sign(size_out [ [i] ] - k - 0.5)) 

## 'estimator' is the estimator of P(max <= k) based on the single run 
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estimator [k] <- estimator [k] + bin_in[i] * sum(indicator [ [i] ] *bin_out) 

} 

} 

return (estimator) 

} 
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